Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions

Ongoing research attempts to benchmark large language models (LLM) against physicians’ fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison’s Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5–80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2–5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5–7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.


Introduction
Several recently released Large Language Models (LLMs) include GPT-3.5 and GPT-4.0 (OpenAI), LaMDA (Google) and Llama 2 (Meta), considered state-of-the-art due to their performance on common evaluation benchmarks (Table 1).The public chatbots ChatGPT, Bing Chat and Google Bard use GPT-3.5, GPT-4.0 and LaMDA, respectively [1].There has been a surge in the proposed medical applications of LLMs, including medical note taking and consultations [2].To that end, attempts have been made to benchmark LLM performance against practicing physicians' fund of knowledge.Research has evaluated their performance on the United States Medical Licensing Examination [3,4] as well as the ophthalmology [1], neonatal [5] and dermatology [6] board exams.Depending on the exam in question, LLMs performed below or above passing score rates and human respondents, raising the question of LLM performance on other subject matters.Additionally, the question of improving LLM performance with additional targeted augmentations also remains since these previous studies did not enhance the LLM systems with domain specific information.In this study, we investigate the Table 1.Large Language Model characteristics.A summary of model performance on common evaluation benchmarks along with their basic architecture and training setup, where available.

GPT-4 [14]
GPT-3.four aforementioned LLMs' performance with and without additional enhancements, on internal medicine (IM) board-style questions from the Medical Knowledge Self-Assessment Program (MKSAP) released by the American College of Physicians [7].

Results
Each LLM's performance on 240 IM board-style questions is presented in Fig 1A .GPT-4.0, accessed either through Bing Chat or its application programming interface (API), scored 77.5-80.7%well above the reported 64% American Board of Internal Medicine (ABIM) exam passing score, outperforming GPT-3.5, LaMDA and Llama 2 in that order.Human MKSAP users averaged 63.6%, on par with GPT-3.5 accessed through ChatGPT or its API with additional input augmentation.There is a 3.2-5.3%decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its chatbot.There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after  2. The highest and lowest percentile scores were in Infectious Disease (80th) and Rheumatology (99.7th), respectively.Additionally, GPT-4.0 outperformed both GPT-3.5 and LaMDA on all IM subjects with the exception of endocrinology and hematology.In nearly every IM subject, GPT-3.5 (ChatGPT) outperformed LaMDA (Bard), with the exception of Oncology and Neurology.
An IM board-certified, blinded reviewer correctly identified the human generated MKSAP response over the GPT-3.5 (ChatGPT) and GPT-4.0 (Bing Chat) responses in 72% of a 25-question sample set.In all but one of the incorrectly identified cases, the reviewer picked ChatGPT as the human response.The reviewer cited a writing style geared towards laypeople rather than IM residents and a comparatively greater number of references to IM subspecialty society guidelines, as the two primary ways of identifying the LLM-generated response.Factual accuracy and clinical reasoning reportedly did not inform the reviewer's decision.

Discussion
The LLMs and in particular GPT-4.0,exhibited high overall and subject-based performance, even without additional input augmentation, while their explanations contained valuable information from a medical education standpoint.Accessed through its API or corresponding chatbot, GPT-4.0 consistently outperformed human respondents and could be reasonably expected to pass the ABIM exam.The present study's limitations include the inability to test ABIM questions directly and the use of solely text-based questions.
The drop in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its chatbot is a key finding of this study.Differences in the underlying LLM system and/or the degree of model training between the API and chatbot versions could explain this observation but are difficult to ascertain given the "black box" nature of these models.[9] Regardless, the drop in LLM performance via the API was recovered through model input enhancements in the form of Harrison's Principles of Internal Medicine [8] for both GPT-3.5 and GPT-4.0.This equivalent performance is key, since an application built upon the LLM's API would not necessarily require internet access limiting ongoing concerns for loss of patient privacy or other sensitive information.Additionally, the 4.5-7.5% score increase following input augmentation with a relatively limited amount of domain specific information raises the possibility of greater improvements with additional augmentation.High yield, structured, domain specific information to assist with this task can be derived from publicly accessible IM subspecialty society websites (e.g., American College of Cardiology), which include updated practice guidelines and clinical trial summaries among other content.The subject-based performance (Fig 1B) analysis can further inform targeted LLM input enhancement.Model input augmentation can be focused on the worst performing subjects to ensure comparatively large improvements in overall examination performance.Each question answered incorrectly by an LLM can be considered a "hallucination."[2] Such "hallucinations" are stated in an assertive manner raising concerns of convincing human users of an erroneous answer.On the one hand, the LLMs failed a rudimentary "Turing"-like test as the blinded reviewer identified more than 50% of the human generated responses.On the other hand, the 28% failure rate provides further support regarding the similarity between LLM-and human-generated responses.To address this, ascribing a variable degree of confidence to each LLM response is not possible with current LLM designs, as the models were trained to emulate textual patterns rather than weigh each generated response according to their degree of factual accuracy [2,9].Bing Chat adds the caveat that the response is AI-generated and further consultation with experts is needed.Additionally, its provision of references allows human users to fact check its responses.The aforementioned illustration of performance improvement with model input augmentation using domain specific information (Fig 1A) could address this limitation by improving factual accuracy.
Despite its limitations, this study provides insights into the future application of LLMs to medical practice.The ability to answer an array of multidisciplinary IM questions while providing supporting explanations speaks to LLMs' potential as a study aid and medical assistant during hospital rounds.Additionally, the ability to reverse the drop in LLM performance when accessing its API through input augmentation with domain specific information will be key when developing LLM-based medical applications that seek to ensure patient privacy.Yet, we should be wary of possible improper uses by limiting access to such artificial intelligence tools during open-book ABIM assessments among physicians maintaining their ABIM board certification.

Question selection
MKSAP questions are similar in complexity to the multiple-choice questions from the ABIM standardized exam, spanning a broad scope of knowledge from diagnosis to medical management.MKSAP is commonly used by IM residents to prepare for the ABIM exam.The ABIM has no publicly available testing material and declined a formal request to release sample questions.In addition to MKSAP, the ACP administers the in-training exam (ITE) to IM residents, a yearly multiple-choice standardized exam used to track residents' educational growth.Performance on the ITE during the third year of IM residency is an independent predictor for passing the ABIM examination [10].The MKSAP questions included in this study were released after January 2022, outside the scope of training data in GPT-3.5 and GPT-4.0, which was confirmed by a Google index search prior to January 1, 2022.All questions containing clinical images were excluded, given the LLMs' inability to process visual information at the time of study.A random selection of 240 questions reflected the number of questions in an ABIM examination and comprised of all major subject areas (representing at least 4% of the ABIM exam [11]).Google and Bing web searches of each of the selected question prompts did not reveal any matching results.

Performance evaluation
All LLM queries took place between July 1 st and September 1 st , 2023.Three LLMs (GPT-3.5,GPT-4.0 and LaMDA) were accessed through their aforementioned publicly available chatbots.Each question was asked in a new chat with a cleared history.GPT-3.5 and GPT-4.0 were also accessed through the OpenAI API, specifically models gpt-3.5-turbo-16k-0613and gpt-4-0613, respectively.A quantized version of LlaMA-2-chat-7b (8-bit, GGML quantization) was hosted and accessed locally.The OpenAI API models were called at zero temperature to ensure maximal model output consistency while no conversational memory was retained between questions.These models were also additionally enhanced by providing context in the form of Harrison's Principles of Internal Medicine [8] through the method of Retrieval Augmented Generation (RAG) [12].Using this method, the model input is expanded to include parts of the provided document that are semantically similar to the exam question posed [13].More specifically, the first 170 pages of Harrison's Principles of Internal Medicine were discarded, as the information contained within was not relevant to the subject (editor information, table of contents, etc.).The filtered document was subsequently split recursively on certain characters (double and single newline character and space) in chunks of 750 characters with 150-character overlap.For each derived chunk, its embeddings were calculated using the text-embedding-ada-002 embedding model (OpenAI) and stored in a Chroma vector database.For each question, 90 of the most similar documents to the question were retrieved from the database to be passed as context.The document-question similarity metric used was cosine similarity.Both question and context were passed to the model wrapped in the following prompt: "'Use the following pieces of context to answer the question at the end.If you don't know the answer, just say that you don't know, don't try to make up an answer.{context} Question: {question} Helpful Answer: "' The RAG prompt used was the same as the one used for the simple API calls, with the addition of context-related instructions.Each question served as part of the LLM prompt, while both an incorrect answer and a declination to answer were considered incorrect.Separately, explanations to 25 correctly answered questions generated by ChatGPT and Bing Chat were randomly selected.The explanations were presented in a blinded fashion alongside the human generated MKSAP explanation to an IM board-certified physician, who had not previously used MKSAP.The blinded reviewer was tasked with selecting the human generated explanation.

Conclusions
Overall, GPT-4.0 performed best among the studied LLMs on IM board-style questions, outperforming human respondents and demonstrating promise for future medical applications.Augmentation with a limited amount of domain-specific information improved LLM performance rendering Retrieval Augmented Generation a viable technique for improving factual accuracy in medical examination LLM responses.

Fig 1 .
Fig 1. Model performance (A) Performance of four large language models (LLMs) evaluated by the percentage of correct answers obtained in 240 internal medicine board-style questions.Llama 2 was queried in the form of a quantized mode (8-bit quantization, GGML).(B) Performance evaluated by the percentage of correct answers obtained in each major subject area.https://doi.org/10.1371/journal.pdig.0000604.g001

Table 2 . Benchmark against human test takers.
Performance of Bing Chat (GPT-4.0) on internal medicine boardstyle questions in each major subject area expressed as a percentile among human test takers.